Skip to content

[client, relay] Advertise relay server IP via signal for foreign-relay fallback dial#6004

Open
lixmal wants to merge 7 commits intomainfrom
relay-server-ip
Open

[client, relay] Advertise relay server IP via signal for foreign-relay fallback dial#6004
lixmal wants to merge 7 commits intomainfrom
relay-server-ip

Conversation

@lixmal
Copy link
Copy Markdown
Collaborator

@lixmal lixmal commented Apr 27, 2026

Describe your changes

Problem

When two peers are on different relays and P2P (ICE) cannot be established, each learns the other peer's relay address through signal as an FQDN. If that FQDN cannot be resolved on the receiving side, the foreign-relay dial fails and the peers never reach each other.

This deadlocks on first connect when the only configured exit-node peer is the one whose relay needs to be resolved: the overlay default route is installed before the exit-node WireGuard handshake completes, so DNS goes through wt0 and fails. Management's relay cache doesn't help here, the foreign peer's home relay is only learned through signal, not from management.

Fix

Peer B already knows the IP it is connected to on its relay. We add that IP to the signal Body so peer A can use it as a dial-time fallback when the FQDN cannot be resolved. SNI/cert validation still uses the FQDN, so TLS is unaffected.

  • Add relayServerIP (bytes, tag 11) to signal Body; reserve tag 9 (previously int64 sessionId) to avoid wire-type collisions
  • Surface the underlying TCP RemoteAddr from the websocket relay conn so ConnectedIP returns the resolved IP after an FQDN dial
  • Plumb the IP as netip.Addr through Manager.OpenConn and Client with a fallback dial path that substitutes the IP into the URL while keeping the original host as SNI
  • Add unit tests for the fallback dial, ConnectedIP after an FQDN dial, and substituteHost; add an end-to-end manager test exercising the foreign-relay path

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Internal protocol field; no user-visible behavior change beyond reachability of foreign relays when DNS for the FQDN is unavailable.

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

New Features

  • Added IP-based fallback mechanism for relay server connections when DNS resolution fails, improving resilience in connectivity scenarios
  • Relay servers now exchange their connected IP addresses to enable direct IP dialing as a fallback option

Tests

  • Added integration tests validating relay server IP recovery and fallback dialing behavior

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The changes implement relay server IP support throughout the Netbird client and relay infrastructure. The client now parses relay server IPs from signal messages, stores them in offer/answer structures, and forwards them to the relay manager. The relay manager and client support IP-based fallback dialing when FQDN resolution fails. Dialers are extended with SNI/ServerName parameters for TLS certificate verification. The signal protocol adds an optional relayServerIP field.

Changes

Cohort / File(s) Summary
Client Signal Processing
client/internal/engine.go, client/internal/peer/handshaker.go
Added RelaySrvIP field to OfferAnswer struct and implemented decodeRelayIP helper to safely parse relay server IP bytes from signal messages into netip.Addr, with validation and normalization.
Signal Message Construction
client/internal/peer/signaler.go
Refactored signalOfferAnswer to call MarshalCredential with structured CredentialPayload parameter, including newly propagated RelaySrvIP and SessionID fields.
Relay Worker
client/internal/peer/worker_relay.go, client/internal/peer/status.go
Updated RelayInstanceAddress() to return (string, netip.Addr, error) tuple; OnNewOffer now extracts and conditionally forwards relay server IP to relayManager.OpenConn.
Relay Client Core
shared/relay/client/client.go
Added NewClientWithServerIP constructor and ConnectedIP() method enabling direct IP dialing with FQDN fallback; dialDirect helper rewrites relay URL host to IP while preserving SNI via original hostname.
Relay Manager
shared/relay/client/manager.go
Extended OpenConn to accept serverIP netip.Addr parameter and forward to openConnVia; updated RelayInstanceAddress() to return connected relay IP alongside address.
QUIC Dialer
shared/relay/client/dialer/quic/quic.go
Extended Dial signature to accept serverName parameter; TLS ServerName selection now prefers explicit serverName parameter over host parsing.
Race Dialer Infrastructure
shared/relay/client/dialer/race_dialer.go
Updated DialeFn interface to include serverName parameter; added serverName field and WithServerName setter to RaceDial; parallel dial phase now passes serverName to each dialer.
WebSocket Dialer
shared/relay/client/dialer/ws/ws.go, shared/relay/client/dialer/ws/conn.go, shared/relay/client/dialer/ws/dialopts_generic.go, shared/relay/client/dialer/ws/dialopts_js.go
Extended Dial to accept serverName parameter; NewConn now takes underlying connection and computes remoteAddr from it; createDialOptions forwards serverName to HTTP client factory for TLS config.
Signal Protocol
shared/signal/proto/signalexchange.proto
Made relayServerAddress optional, added reserved field 9, and introduced new optional bytes relayServerIP = 11 field for relay server IP propagation.
Signal Marshal
shared/signal/client/client.go
Refactored MarshalCredential to accept structured CredentialPayload parameter; added CredentialPayload struct containing type, credential, Rosenpass fields, relay address/IP, and session ID; now conditionally populates proto RelayServerIP field.
Race Dialer Tests
shared/relay/client/dialer/race_dialer_test.go
Updated MockDialer.Dial signature to accept unused serverName parameter matching interface changes.
Client Server IP Tests
shared/relay/client/client_serverip_test.go
Added comprehensive integration and unit tests validating IP-based fallback dialing, ConnectedIP() parsing from various RemoteAddr formats, and substituteHost URL rewriting across schemes and IPv6 cases.
Manager Server IP Tests
shared/relay/client/manager_serverip_test.go
Added integration test demonstrating foreign-relay server IP recovery with intentionally unresolvable FQDN and connection echo validation.
Manager Tests
shared/relay/client/manager_test.go
Updated all manager test calls to match new API signatures: RelayInstanceAddress() returns tuple, OpenConn() requires netip.Addr parameter.

Sequence Diagram

sequenceDiagram
    participant Signal as Signal Server
    participant Client as Client Engine
    participant Handshaker as Handshaker
    participant Manager as Relay Manager
    participant RelayClient as Relay Client
    participant Dialer as QUIC/WS Dialer

    Signal->>Client: Signal Message (relayServerIP bytes)
    Client->>Client: convertToOfferAnswer()
    Client->>Client: decodeRelayIP() → netip.Addr
    Client->>Handshaker: buildOfferAnswer()
    Handshaker->>Handshaker: OfferAnswer.RelaySrvIP = IP
    Handshaker->>Client: return OfferAnswer with IP
    Client->>Client: signalOfferAnswer()
    Client->>Signal: MarshalCredential(CredentialPayload with RelaySrvIP)
    
    Note over Client,Dialer: Peer receives offer with relay IP
    
    Client->>Manager: OpenConn(serverAddress, peerKey, serverIP)
    Manager->>RelayClient: NewClientWithServerIP(serverURL, serverIP)
    RelayClient->>RelayClient: connect()
    RelayClient->>Dialer: dialDirect(relayURL with serverIP)
    
    alt Direct IP dial fails
        RelayClient->>Dialer: fallback to dialFQDN(relayURL)
        Dialer->>Dialer: DNS resolution
    end
    
    Dialer->>Dialer: TLS with SNI=original hostname
    Dialer->>RelayClient: return net.Conn
    RelayClient->>RelayClient: ConnectedIP() → remote IP
    RelayClient->>Manager: return connection
    Manager->>Client: return connection ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • pappz
  • pascal-fischer

Poem

🐰 Hop, hop! A relay IP hops through the wire,
No DNS needed when fallbacks inspire!
From signal to socket, the bytes find their way,
Direct dialing magic—no DNS delay!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: advertising relay server IP via signal for foreign-relay fallback dialing.
Description check ✅ Passed The description provides comprehensive details: problem statement, fix explanation, technical changes across files, test coverage, and documentation justification. All required sections are present.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch relay-server-ip

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lixmal lixmal changed the title [client, signal, relay] Advertise relay server IP via signal for foreign-relay fallback dial [client, relay] Advertise relay server IP via signal for foreign-relay fallback dial Apr 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
shared/signal/proto/signalexchange.proto (1)

68-76: LGTM — proper field deprecation and clear semantics.

Reserving tag 9 (previously int64 sessionId) is the correct way to prevent wire-incompatible reuse after the type change to optional bytes sessionId = 10. The new bytes relayServerIP = 11 documentation usefully calls out:

  • 4-byte (IPv4) / 16-byte (IPv6) encoding (matches netip.Addr.AsSlice()),
  • fallback-only usage,
  • SNI/TLS verification still bound to relayServerAddress.

One consideration: also reserved "sessionId" (the old name) alongside the tag, in case any tooling/codegen elsewhere keys off the legacy name. Optional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/signal/proto/signalexchange.proto` around lines 68 - 76, Add a
symbolic reservation for the old field name to avoid tooling/codegen reusing it:
next to the existing numeric reservation "reserved 9;" add reserved "sessionId";
so that the proto contains both reserved 9; and reserved "sessionId"; while
keeping the new optional bytes sessionId = 10 and bytes relayServerIP = 11
unchanged.
shared/relay/client/dialer/race_dialer.go (1)

46-51: Minor: fluent setter is fine, but document concurrency expectations.

WithServerName mutates the receiver in place. Given RaceDial is constructed per-dial in practice, this is safe, but a one-liner clarifying that RaceDial is not safe for concurrent reconfiguration would prevent future misuse if it's ever cached/shared.

📝 Suggested doc tweak
-// WithServerName sets a TLS SNI/cert validation override. Used when serverURL
-// contains an IP literal but the cert is issued for a different hostname.
+// WithServerName sets a TLS SNI/cert validation override. Used when serverURL
+// contains an IP literal but the cert is issued for a different hostname.
+// Not safe to call concurrently with Dial; configure before invoking Dial.
 func (r *RaceDial) WithServerName(serverName string) *RaceDial {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/dialer/race_dialer.go` around lines 46 - 51, The
WithServerName method mutates the RaceDial receiver in place and may be misused
concurrently; update the doc comment for func (r *RaceDial)
WithServerName(serverName string) *RaceDial to state that it mutates the
receiver, is intended to be used on a per-dial RaceDial instance, and is not
safe for concurrent reconfiguration (recommend creating a new RaceDial or
copying before reuse). Also mention that RaceDial is typically constructed
per-dial to make the concurrency expectation explicit.
shared/relay/client/dialer/race_dialer_test.go (1)

31-33: LGTM!

The _ blank identifier on the new serverName parameter is the right choice here since the mock's dialFunc field signature is intentionally not extended — existing tests don't exercise SNI override behavior, and the mock continues to forward only ctx/address.

Worth considering as a follow-up: a dedicated test that asserts RaceDial.WithServerName(...) propagates the override into Dial, since the mock currently ignores it. Optional, not blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/dialer/race_dialer_test.go` around lines 31 - 33, The
blank identifier for the new serverName parameter in MockDialer.Dial is fine for
now, but to test SNI override behavior add a follow-up unit test: extend the
mock to capture the serverName (e.g. add or change MockDialer.dialFunc to accept
the serverName or add a separate hook), update MockDialer.Dial to forward the
serverName to that hook, then add a test calling RaceDial.WithServerName(...)
and assert the mock received the expected serverName in Dial; reference
MockDialer.Dial, MockDialer.dialFunc, RaceDial.WithServerName and Dial when
making these changes.
shared/relay/client/dialer/quic/quic.go (1)

26-45: LGTM — SNI override semantics are correct.

Explicit serverName correctly takes precedence over URL-derived SNI, and the default branch preserves the prior IP-literal-skip behavior (avoids invalid SNI for [::1]:443 / 1.2.3.4:443).

One small consideration: if a caller passes serverName that itself is an IP literal, tls.Config.ServerName will be set to an IP. Go's crypto/tls omits the SNI extension in this case while still using the IP for certificate verification. Given the documented contract on RaceDial.WithServerName ("override when serverURL contains an IP literal but cert is issued for a different hostname"), callers are expected to pass an FQDN. An optional sanity check (e.g., skip override if net.ParseIP(serverName) != nil) would harden against misuse, but the code is sound as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/dialer/quic/quic.go` around lines 26 - 45, The Dial
function currently unconditionally sets tlsClientConfig.ServerName when
serverName is provided; add a sanity check so we only assign ServerName if
serverName is not an IP literal (use net.ParseIP(serverName) == nil) to avoid
forcing an IP into SNI (which crypto/tls omits); update the switch/case handling
in Dial (look for Dial, serverName, tlsClientConfig.ServerName) to skip the
override when serverName parses as an IP and leave the URL-derived behavior
intact.
shared/relay/client/client.go (1)

380-386: Consider preserving the fallback dial error for diagnostics.

When the primary dial fails and dialFallback also fails, only the primary error is propagated and fbErr is silently discarded. Because dialFallback distinguishes useful operator-facing cases ("no fallback IP configured", substituteHost parse errors, actual fallback dial failure), losing it makes diagnosing why the fallback didn't recover the connection harder in the field.

♻️ Suggested change
 	rd := dialer.NewRaceDial(c.log, dialer.DefaultConnectionTimeout, c.connectionURL, dialers...)
 	conn, err := rd.Dial(ctx)
 	if err != nil {
 		fallbackConn, fbErr := c.dialFallback(ctx, dialers)
 		if fbErr != nil {
-			return nil, err
+			return nil, fmt.Errorf("primary dial: %w; fallback: %v", err, fbErr)
 		}
 		conn = fallbackConn
 	}

Or use errors.Join(err, fbErr) if joining is preferred over wrapping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/client.go` around lines 380 - 386, When primary dial
failure falls back into c.dialFallback, preserve the fallback error instead of
discarding fbErr; in the error path inside the block around c.dialFallback(ctx,
dialers) return a combined error that includes both err and fbErr (e.g., using
errors.Join(err, fbErr) or wrapping fbErr into err with fmt.Errorf) so callers
see why the fallback also failed; update the branch that currently does "if
fbErr != nil { return nil, err }" to return a composed error referencing err and
fbErr while keeping the same return types and behavior when fallback succeeds
and sets conn = fallbackConn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/engine.go`:
- Around line 2405-2418: The dialing fallback currently uses c.fallbackIP when
only checked by IsValid(), which allows unspecified addresses like 0.0.0.0 or ::
(produced by decodeRelayIP returning the zero netip.Addr) to be used; update the
guard in dialFallback (referencing c.fallbackIP and the dialFallback function in
shared/relay/client/client.go) to require a non-unspecified address by changing
the condition to also check IsUnspecified (e.g., treat as invalid if
!c.fallbackIP.IsValid() || c.fallbackIP.IsUnspecified()) so unspecified zero
addresses are not used as fallback targets.

---

Nitpick comments:
In `@shared/relay/client/client.go`:
- Around line 380-386: When primary dial failure falls back into c.dialFallback,
preserve the fallback error instead of discarding fbErr; in the error path
inside the block around c.dialFallback(ctx, dialers) return a combined error
that includes both err and fbErr (e.g., using errors.Join(err, fbErr) or
wrapping fbErr into err with fmt.Errorf) so callers see why the fallback also
failed; update the branch that currently does "if fbErr != nil { return nil, err
}" to return a composed error referencing err and fbErr while keeping the same
return types and behavior when fallback succeeds and sets conn = fallbackConn.

In `@shared/relay/client/dialer/quic/quic.go`:
- Around line 26-45: The Dial function currently unconditionally sets
tlsClientConfig.ServerName when serverName is provided; add a sanity check so we
only assign ServerName if serverName is not an IP literal (use
net.ParseIP(serverName) == nil) to avoid forcing an IP into SNI (which
crypto/tls omits); update the switch/case handling in Dial (look for Dial,
serverName, tlsClientConfig.ServerName) to skip the override when serverName
parses as an IP and leave the URL-derived behavior intact.

In `@shared/relay/client/dialer/race_dialer_test.go`:
- Around line 31-33: The blank identifier for the new serverName parameter in
MockDialer.Dial is fine for now, but to test SNI override behavior add a
follow-up unit test: extend the mock to capture the serverName (e.g. add or
change MockDialer.dialFunc to accept the serverName or add a separate hook),
update MockDialer.Dial to forward the serverName to that hook, then add a test
calling RaceDial.WithServerName(...) and assert the mock received the expected
serverName in Dial; reference MockDialer.Dial, MockDialer.dialFunc,
RaceDial.WithServerName and Dial when making these changes.

In `@shared/relay/client/dialer/race_dialer.go`:
- Around line 46-51: The WithServerName method mutates the RaceDial receiver in
place and may be misused concurrently; update the doc comment for func (r
*RaceDial) WithServerName(serverName string) *RaceDial to state that it mutates
the receiver, is intended to be used on a per-dial RaceDial instance, and is not
safe for concurrent reconfiguration (recommend creating a new RaceDial or
copying before reuse). Also mention that RaceDial is typically constructed
per-dial to make the concurrency expectation explicit.

In `@shared/signal/proto/signalexchange.proto`:
- Around line 68-76: Add a symbolic reservation for the old field name to avoid
tooling/codegen reusing it: next to the existing numeric reservation "reserved
9;" add reserved "sessionId"; so that the proto contains both reserved 9; and
reserved "sessionId"; while keeping the new optional bytes sessionId = 10 and
bytes relayServerIP = 11 unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2bfe7a9-1d40-4e3e-98b0-ffed084511ed

📥 Commits

Reviewing files that changed from the base of the PR and between 154b816 and b9a0418.

⛔ Files ignored due to path filters (1)
  • shared/signal/proto/signalexchange.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (22)
  • client/internal/engine.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/signaler.go
  • client/internal/peer/worker_relay.go
  • relay/test/benchmark_test.go
  • relay/testec2/relay.go
  • shared/relay/client/client.go
  • shared/relay/client/client_fallback_test.go
  • shared/relay/client/client_test.go
  • shared/relay/client/dialer/quic/quic.go
  • shared/relay/client/dialer/race_dialer.go
  • shared/relay/client/dialer/race_dialer_test.go
  • shared/relay/client/dialer/ws/conn.go
  • shared/relay/client/dialer/ws/dialopts_generic.go
  • shared/relay/client/dialer/ws/dialopts_js.go
  • shared/relay/client/dialer/ws/ws.go
  • shared/relay/client/manager.go
  • shared/relay/client/manager_fallback_test.go
  • shared/relay/client/manager_test.go
  • shared/relay/client/picker.go
  • shared/signal/client/client.go
  • shared/signal/proto/signalexchange.proto

Comment thread client/internal/engine.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
shared/relay/client/dialer/race_dialer_test.go (1)

26-33: Optional: consider plumbing serverName into dialFunc for future test coverage.

The mock currently discards the new serverName parameter. That's fine for the existing race-coordination tests, but if/when behavior depends on serverName (TLS SNI override is now part of the dialer contract), tests won't be able to assert it without widening the dialFunc signature. Not a blocker for this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/dialer/race_dialer_test.go` around lines 26 - 33, The
MockDialer currently ignores the new serverName parameter; update
MockDialer.dialFunc signature to func(ctx context.Context, address, serverName
string) (net.Conn, error), change the MockDialer.Dial method to accept the
serverName param and pass it through to dialFunc (i.e., return m.dialFunc(ctx,
address, serverName)), and update any test instantiations of MockDialer to
supply a matching dialFunc so tests can assert on serverName (leave protocolStr
unchanged).
shared/relay/client/dialer/ws/ws.go (1)

88-91: Set an explicit MinVersion on the WS TLS config.

Static analysis flagged this block. Since you're already amending tls.Config here to set ServerName, it's a low-cost moment to also pin the minimum TLS version (TLS 1.2 or 1.3) and avoid relying on Go's default, which has shifted between releases.

♻️ Proposed change
 		TLSClientConfig: &tls.Config{
 			RootCAs:    certPool,
 			ServerName: serverName,
+			MinVersion: tls.VersionTLS12,
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/dialer/ws/ws.go` around lines 88 - 91, The TLS config in
the WS dialer currently sets RootCAs and ServerName but doesn't pin a minimum
TLS version; update the tls.Config used for TLSClientConfig (in ws.go where
TLSClientConfig is constructed) to include MinVersion: tls.VersionTLS12 (or
tls.VersionTLS13 if you prefer stricter policy) so the client refuses older
protocol versions; ensure you reference the tls.Config used in the WS dialer
initialization (the TLSClientConfig field) when adding this option.
shared/relay/client/client.go (1)

421-447: Minor: SNI with IP-literal serverName won't help cert validation.

When serverURL is rels://[2001:db8::5]:443 (IPv6 literal), substituteHost returns serverName == "2001:db8::5", which WithServerName will set as tls.Config.ServerName. Per RFC 6066 SNI must be a DNS name; Go's TLS stack will then verify the peer cert against that IP via x509.Certificate.VerifyHostname, which generally requires an IP SAN. In practice this fallback path is only exercised when the primary FQDN dial fails, so an IP-literal serverURL shouldn't reach here — but consider rejecting an IP-literal origHost early (or skipping WithServerName in that case) to avoid silently producing an unverifiable TLS config.

Not blocking; flagging since the unit test pins this behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/client.go` around lines 421 - 447, substituteHost
currently returns origHost even when it's an IP literal, which leads callers to
set tls.Config.ServerName to an IP (invalid for SNI). Change substituteHost to
detect and treat IP-literal original hosts specially: parse origHost (e.g., via
netip.ParseAddr or net.ParseIP) and if it is an IP literal return an empty
origHost (or otherwise signal “no server name”) so callers will skip
WithServerName; otherwise return the hostname as before. Update substituteHost
to document that an empty origHost means “do not set TLS ServerName.”
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/client_fallback_test.go`:
- Line 24: Replace the hardcoded listenAddr ("127.0.0.1:50301") with a dynamic
bind ("127.0.0.1:0") so the OS chooses an available port; after creating the
listener, read the real address via listener.Addr().String() (or extract
host/port) and use that value when building the server URL/ExposedAddress and
any client dial targets (references: listenAddr variable, ExposedAddress, and
the test setup in client_fallback_test.go), ensuring the test no longer assumes
a fixed port.

---

Nitpick comments:
In `@shared/relay/client/client.go`:
- Around line 421-447: substituteHost currently returns origHost even when it's
an IP literal, which leads callers to set tls.Config.ServerName to an IP
(invalid for SNI). Change substituteHost to detect and treat IP-literal original
hosts specially: parse origHost (e.g., via netip.ParseAddr or net.ParseIP) and
if it is an IP literal return an empty origHost (or otherwise signal “no server
name”) so callers will skip WithServerName; otherwise return the hostname as
before. Update substituteHost to document that an empty origHost means “do not
set TLS ServerName.”

In `@shared/relay/client/dialer/race_dialer_test.go`:
- Around line 26-33: The MockDialer currently ignores the new serverName
parameter; update MockDialer.dialFunc signature to func(ctx context.Context,
address, serverName string) (net.Conn, error), change the MockDialer.Dial method
to accept the serverName param and pass it through to dialFunc (i.e., return
m.dialFunc(ctx, address, serverName)), and update any test instantiations of
MockDialer to supply a matching dialFunc so tests can assert on serverName
(leave protocolStr unchanged).

In `@shared/relay/client/dialer/ws/ws.go`:
- Around line 88-91: The TLS config in the WS dialer currently sets RootCAs and
ServerName but doesn't pin a minimum TLS version; update the tls.Config used for
TLSClientConfig (in ws.go where TLSClientConfig is constructed) to include
MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 if you prefer stricter policy)
so the client refuses older protocol versions; ensure you reference the
tls.Config used in the WS dialer initialization (the TLSClientConfig field) when
adding this option.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72051d79-50ab-46a4-86d4-fa15b6865bbe

📥 Commits

Reviewing files that changed from the base of the PR and between b9a0418 and ca5c6a9.

⛔ Files ignored due to path filters (1)
  • shared/signal/proto/signalexchange.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (22)
  • client/internal/engine.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/signaler.go
  • client/internal/peer/worker_relay.go
  • relay/test/benchmark_test.go
  • relay/testec2/relay.go
  • shared/relay/client/client.go
  • shared/relay/client/client_fallback_test.go
  • shared/relay/client/client_test.go
  • shared/relay/client/dialer/quic/quic.go
  • shared/relay/client/dialer/race_dialer.go
  • shared/relay/client/dialer/race_dialer_test.go
  • shared/relay/client/dialer/ws/conn.go
  • shared/relay/client/dialer/ws/dialopts_generic.go
  • shared/relay/client/dialer/ws/dialopts_js.go
  • shared/relay/client/dialer/ws/ws.go
  • shared/relay/client/manager.go
  • shared/relay/client/manager_fallback_test.go
  • shared/relay/client/manager_test.go
  • shared/relay/client/picker.go
  • shared/signal/client/client.go
  • shared/signal/proto/signalexchange.proto
✅ Files skipped from review due to trivial changes (3)
  • shared/signal/proto/signalexchange.proto
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • client/internal/peer/signaler.go
  • shared/relay/client/dialer/ws/conn.go
  • client/internal/engine.go
  • shared/signal/client/client.go
  • shared/relay/client/dialer/ws/dialopts_generic.go
  • shared/relay/client/manager.go

Comment thread shared/relay/client/client_fallback_test.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
shared/relay/client/dialer/ws/ws.go (1)

87-92: Consider setting an explicit TLS MinVersion.

The relay server does not explicitly set MinVersion in its TLS configuration, relying on Go's default minimum (TLS 1.2). For consistency and to make security requirements explicit, pin the client to tls.VersionTLS12 in this configuration. This prevents reliance on Go's defaults which could theoretically change in future versions, and ensures compatibility with relay servers that don't enforce TLS 1.3.

♻️ Suggested change
 		TLSConfig: &tls.Config{
 			RootCAs:    certPool,
 			ServerName: serverName,
+			MinVersion: tls.VersionTLS12,
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/dialer/ws/ws.go` around lines 87 - 92, The TLS client
configuration currently sets RootCAs and ServerName but omits an explicit
MinVersion; update the TLSClientConfig (the struct that uses certPool and
serverName) to include MinVersion: tls.VersionTLS12 so the client pins the
minimum TLS version to TLS 1.2 (use the tls package constant tls.VersionTLS12 in
the TLSClientConfig initialization).
shared/relay/client/manager_fallback_test.go (2)

96-145: LGTM on the round-trip flow.

Subtest correctly drives the foreign-relay path: Bob registers on his relay, Alice dials the broken FQDN with Bob's advertised IP and the payload echoes back. The bobSideCh goroutine is safely bounded by ctx (cancelled in t.Cleanup via mCancel), so a t.Fatalf on the Alice side won't permanently leak the reader.

One small nit: the time.After(5*time.Second) timeout here can outlive the parent ctx deadline (20s less time already spent). That's fine in practice but consider <-ctx.Done() instead so any future tightening of the parent timeout propagates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/manager_fallback_test.go` around lines 96 - 145, The test
"fallback IP recovers" uses time.After(5 * time.Second) which can outlive the
parent test context; change the select waiting for bobSideCh to instead listen
on <-ctx.Done() so the parent context deadline/cancellation propagates (look for
the select that reads from bobSideCh in the subtest and replace the time.After
branch with a ctx.Done() branch that fails the test, e.g., t.Fatalf with
ctx.Err()).

23-23: Use ephemeral ports to prevent flakiness from port collisions.

The hardcoded ports 52401 and 52402 can cause test failures if another process binds them or if the test runs in parallel with other instances. The same package already provides a freeAddr() helper in client_fallback_test.go that obtains an ephemeral port via net.Listen("tcp", "127.0.0.1:0"). Use this pattern to dynamically allocate ports and read the bound addresses from the servers, avoiding TIME_WAIT collisions and making the test safely repeatable.

Also applies to: 40-40, 82-82

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/manager_fallback_test.go` at line 23, Replace the
hardcoded ports in ListenerConfig instances with addresses obtained from the
existing freeAddr() helper used in client_fallback_test.go: call freeAddr() to
get an ephemeral address, assign it to the Address field of
server.ListenerConfig (e.g., where homeCfg := server.ListenerConfig{...} and
other configs are created), and use the bound address returned by the listeners
when starting servers so tests use dynamic ephemeral ports instead of
"127.0.0.1:52401"/"52402"; ensure all three occurrences (the homeCfg creation
and the two other ListenerConfig usages) are updated to use freeAddr() and the
resulting address variables.
shared/signal/client/client.go (1)

74-96: LGTM — payload struct + relay IP encoding.

Gating on IsValid() and using Unmap().AsSlice() is the right choice: an IPv4-mapped IPv6 (e.g. ::ffff:1.2.3.4) is normalized to a 4-byte slice. The decoder (decodeRelayIP in client/internal/engine.go) already handles both 4- and 16-byte inputs correctly via netip.AddrFromSlice, which the function's docstring explicitly documents.

Optional nit: p.Credential is dereferenced unconditionally on line 84; if a future caller passes a zero CredentialPayload{} this will panic. A nil-guard returning a clear error would harden the API without changing happy-path behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/signal/client/client.go` around lines 74 - 96, MarshalCredential
dereferences p.Credential without checking for nil which can panic if a caller
passes a zero CredentialPayload; add a nil-guard at the top of MarshalCredential
(the function in shared/signal/client/client.go) that checks if p.Credential ==
nil and returns nil plus a clear error (e.g., fmt.Errorf("missing credential in
CredentialPayload")) so callers get a descriptive error instead of a panic,
keeping the current return signature (*proto.Message, error).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@shared/relay/client/dialer/ws/ws.go`:
- Around line 87-92: The TLS client configuration currently sets RootCAs and
ServerName but omits an explicit MinVersion; update the TLSClientConfig (the
struct that uses certPool and serverName) to include MinVersion:
tls.VersionTLS12 so the client pins the minimum TLS version to TLS 1.2 (use the
tls package constant tls.VersionTLS12 in the TLSClientConfig initialization).

In `@shared/relay/client/manager_fallback_test.go`:
- Around line 96-145: The test "fallback IP recovers" uses time.After(5 *
time.Second) which can outlive the parent test context; change the select
waiting for bobSideCh to instead listen on <-ctx.Done() so the parent context
deadline/cancellation propagates (look for the select that reads from bobSideCh
in the subtest and replace the time.After branch with a ctx.Done() branch that
fails the test, e.g., t.Fatalf with ctx.Err()).
- Line 23: Replace the hardcoded ports in ListenerConfig instances with
addresses obtained from the existing freeAddr() helper used in
client_fallback_test.go: call freeAddr() to get an ephemeral address, assign it
to the Address field of server.ListenerConfig (e.g., where homeCfg :=
server.ListenerConfig{...} and other configs are created), and use the bound
address returned by the listeners when starting servers so tests use dynamic
ephemeral ports instead of "127.0.0.1:52401"/"52402"; ensure all three
occurrences (the homeCfg creation and the two other ListenerConfig usages) are
updated to use freeAddr() and the resulting address variables.

In `@shared/signal/client/client.go`:
- Around line 74-96: MarshalCredential dereferences p.Credential without
checking for nil which can panic if a caller passes a zero CredentialPayload;
add a nil-guard at the top of MarshalCredential (the function in
shared/signal/client/client.go) that checks if p.Credential == nil and returns
nil plus a clear error (e.g., fmt.Errorf("missing credential in
CredentialPayload")) so callers get a descriptive error instead of a panic,
keeping the current return signature (*proto.Message, error).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3c348c82-145c-468a-b35a-556c33dea40c

📥 Commits

Reviewing files that changed from the base of the PR and between ca5c6a9 and e7bd62f.

⛔ Files ignored due to path filters (1)
  • shared/signal/proto/signalexchange.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (22)
  • client/internal/engine.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/signaler.go
  • client/internal/peer/worker_relay.go
  • relay/test/benchmark_test.go
  • relay/testec2/relay.go
  • shared/relay/client/client.go
  • shared/relay/client/client_fallback_test.go
  • shared/relay/client/client_test.go
  • shared/relay/client/dialer/quic/quic.go
  • shared/relay/client/dialer/race_dialer.go
  • shared/relay/client/dialer/race_dialer_test.go
  • shared/relay/client/dialer/ws/conn.go
  • shared/relay/client/dialer/ws/dialopts_generic.go
  • shared/relay/client/dialer/ws/dialopts_js.go
  • shared/relay/client/dialer/ws/ws.go
  • shared/relay/client/manager.go
  • shared/relay/client/manager_fallback_test.go
  • shared/relay/client/manager_test.go
  • shared/relay/client/picker.go
  • shared/signal/client/client.go
  • shared/signal/proto/signalexchange.proto
✅ Files skipped from review due to trivial changes (4)
  • shared/relay/client/dialer/race_dialer_test.go
  • shared/relay/client/dialer/ws/dialopts_generic.go
  • shared/relay/client/dialer/quic/quic.go
  • shared/relay/client/client_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • shared/relay/client/picker.go
  • shared/signal/proto/signalexchange.proto
  • relay/test/benchmark_test.go
  • client/internal/peer/signaler.go
  • shared/relay/client/client_fallback_test.go
  • shared/relay/client/client.go

// encoded as 4 bytes (IPv4) or 16 bytes (IPv6). Receivers may use it as a
// fallback dial target when DNS resolution of relayServerAddress fails.
// SNI/TLS verification still uses relayServerAddress.
bytes relayServerIP = 11;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not optional? (realayServerAddress also should be optional)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really make a difference for bytes, but I'll add it for consistency

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread shared/relay/client/manager.go Outdated
// RelayInstanceIP returns the IP address of the live home relay connection.
// Zero value if not connected. Sent alongside RelayInstanceAddress so remote
// peers can dial directly without their own DNS lookup.
func (m *Manager) RelayInstanceIP() netip.Addr {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the RelayInstanceAddress() along with this variable next to the domain. We can avoid a rare inconsistent state during a relay reconnection.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread shared/relay/client/client.go Outdated
func NewClient(serverURL string, authTokenStore *auth.TokenStore, peerID string, mtu uint16) *Client {
// is called. fallbackIP, when valid, is used as a dial-time fallback if the FQDN-based dial fails. TLS
// verification still uses the FQDN from serverURL via SNI.
func NewClient(serverURL string, fallbackIP netip.Addr, authTokenStore *auth.TokenStore, peerID string, mtu uint16) *Client {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a new constructor with the extended parameter because the new parameter is not mandatory for all cases. NewClientForForeginServer or NewClientWithFallbackIp.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

lixmal added 4 commits April 29, 2026 06:25
Mark relayServerAddress and relayServerIP as optional in the signal proto, return the relay instance address and IP atomically from Manager.RelayInstanceAddress to avoid divergence across reconnections, and split the relay client constructor into NewClient and NewClientWithServerIP. Rename fallback terminology to server IP throughout.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/client_serverip_test.go`:
- Around line 268-279: The helper freeAddr creates a bind-race by closing the
listener before the test server binds; change the approach so the listener is
not closed and the test server uses the existing listener. Modify freeAddr (or
replace it) to return the active net.Listener (and its port/address) instead of
closing it, and update callers/tests that currently call freeAddr to accept and
pass that listener into the server startup (or call server.Serve(listener)) so
the port remains reserved until the server owns it; ensure the function
signature and all references to freeAddr are updated accordingly.

In `@shared/relay/client/manager_serverip_test.go`:
- Around line 126-131: The test currently calls aliceConn.Read into buf and
assumes it fills the entire payload, which can flake on short reads; update the
read logic in manager_serverip_test.go (the aliceConn/read of buf for payload)
to either use io.ReadFull(aliceConn, buf) and handle its error, or capture the
returned n from aliceConn.Read and compare only buf[:n] to payload (or loop
until total bytes read equals len(payload)); ensure any error from io.ReadFull
or Read is properly checked and fails the test with a clear message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 89ed7ad4-34ba-40d0-889d-df262ce03700

📥 Commits

Reviewing files that changed from the base of the PR and between e7bd62f and b5ed7c9.

⛔ Files ignored due to path filters (1)
  • shared/signal/proto/signalexchange.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (12)
  • client/internal/engine.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/status.go
  • client/internal/peer/worker_relay.go
  • relay/test/benchmark_test.go
  • shared/relay/client/client.go
  • shared/relay/client/client_serverip_test.go
  • shared/relay/client/manager.go
  • shared/relay/client/manager_serverip_test.go
  • shared/relay/client/manager_test.go
  • shared/signal/client/client.go
  • shared/signal/proto/signalexchange.proto
🚧 Files skipped from review as they are similar to previous changes (4)
  • relay/test/benchmark_test.go
  • shared/signal/proto/signalexchange.proto
  • shared/signal/client/client.go
  • shared/relay/client/manager_test.go

Comment thread shared/relay/client/client_serverip_test.go
Comment thread shared/relay/client/manager_serverip_test.go
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants